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ABSTRACT 

TheT^ Go u i iL iL iGca - roh wao initi ated to investigate 
whether item-sampling as a procedure would yield a more accurate and 
stable index of student achievement during formative evaluation when 
compared to indices arrived at by the traditional method of assessing 
pupil knowledge and understandings within the framework of multiple 
choice testing for student evaluation. Results h^ve indicated that 
item-sampling as a method for measuring classroom achievement 
provides no more precise information than tests of the same length 
constructed in the traditional manner • It was shown that 
item-sampling can be employed for classroom assessment without the 
fear that perhaps the procedure itself would deter from some estimate 
of an individual's performance. The research has demonstrated that 
item-sampling can provide feedback to the instructor over a greater 
range of content objectives within the same time limits that 
typically provide for a narrower sampling of course related 
objectives by way of traditional test construction. It was also shown 
that item-sampling, in addition to covering a greater range of 
content objectives, can do so with a fewer number of items per test 
without losing predictive power. (Author) 
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Itcm-sanpl ing (matrix sampling) is defined as a procedure v/hereby a sc?t 
of test items are randomly divided into j< subsets of Items. A population of 
subjects are as well identified and randomly separated into j< samples. Each of 
tho k samples of subjects then receive one of the random subsets of items. 
This process was first introduced by Lord (1962) as a viable procedure for the 
estimation of test norms. Subsequent research by other Investigators (Cahen, 
Romber, & Zwirner, 1970; Cook & Stuff lebeam, 1967; Lord & Novick, 1968; 
Plujr.bee, 1964; Shoemaker^ 1970) has provided a wealth of supportive data 
regarding the utility of this model for estimating such norms. Item-sampling 
research also has been conducted to investigate the implications for context 
effects (SIrotnik, 1970);, methods for estimating reliability and standard 
errors of item-sampled tests (Zimmerman, 1969; and Shoemaker, 1970), and its 
feasibility in the collection of data when measuring attitudes (Peterson & 
Anderson, 1971; Shoemaker, 1971). 

Cronbach (1963) and, more recently, Wiley (1970) have suggested that 
item-sampling could be a useful technique for classroom evaluation. V/ithln thl 
framework a wider range of course content objectives could be surveyed more 
efficiently, and therefore, feedback during formative evaluation concerning 
pupil outcomes would be far more comprehensive. If item-sampling were to be 
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employed wlthfn the context of classroom assessrrient, the advantages regarding 
feedback to the instructor seem readily apparent. However, if a function of 
the testing Is to also provide an Index of the relative achievement or mastery 
by the student, then questions about the stability and accuracy of scores on an 
item-sampled test must be Investigated. More directly, hov/ comparable to 
traditional testlnp procedures are scores which are arrived at by totaling the 
number of ^rrr^rt rps]?f ^i^^s irt - ^^, jdfpnt s t ^?t^^ ncinr; i+Pm>>QampM ng procedures, 
and would these scores provide a more accurate estimate of surr/natlve behavior 
as giving all students the same items? Because little empirical research har^ 
been reported that might answer these questions, the present study was initiated 
to investigate whether Item-sampling as a procedure would yield a more accurate 
and stable Index of student achievement during formative evaluation when 
compared to Indices arrived at by the traditional method of assessing pupil 
knowledge and understanding. 



Method 

Over a two semester period 95 graduate students enrol led In three 
sections of a course In introductory statistics served as subjects (Ss). Within 
each section three ln--class multiple choice exams were administered during the 
semester. S^s were notified In advance that they were to be tested In the 
succeeding class meeting over a given area of content. Each of these three (3) 
in-class exams consisted of twenty (20) test items with at least six different 
forms for each of the three exams. Ten randomly selected items on each test 
were constant for all forms and were taken by all students while the remaining 
ten items were randomly sampled from an existing Item pool to make up the 
different forms of a particular exam. Students were then randomly assigned to 
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a given form. On the various forms of the different exams there was no 
indication to the student as to which were the constant items and whfdi may 
have been the randomly sampled Items, Under this sampling procedure, ten items 
on each test v/ere the same for all subjects while random samples of subjects 
received randan samples of Items for the remaining ten Items, This procedure 
provided three Indices for all students on each exam: the number of the ten 

■ ~ — conctant - I t ems. . a nswered correctly, the number of the ten randomly sampled Ftemi ""^ 

answered correctly, and the combined number of items answered correctly for the 
to re I ?0-item exanu 

To serve as the criterion measure for the research, a final exam made 
up of 60 multiple choice Items \/as administered at the conclusion of the 
ccurse. The odd-even split-half reliability cf the final exam was ,912 while 
reliabilities of the shorter exams during the semester ranged from .67 to .1^ 
(r.ee Table t ). 

As the primary method of analysis, a step-wise multiple linear 
regression was utilized to maximally weight In-class exams for prediction of 
the final exam scores. Three regression equations were developed: one us:no 
scores of sjbjects on the constant items for the three In-class exams as 
Independ3nt variables, another used scores from the item-sampled portions of 
•;he tests, and the last was based on the combined scores for the two ten- Item 
parts of the total 20- Item In-class exams. 

Results 

Table I presents the means, i ntercorrelatlons, and spllt-ha!f 
reliabilities of the constant, sampled and total test Items obtained ever all 
subjects for each of the three test administrations. It was assumed that 
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sampled fest itens making up a given test form would represent a parallel 

form of the constant Items for a given administration. As can be noted In 

Table I the moderate i ntercorrelations betv/een sampled and constant Items for 

the three test situations do not tend to encourage this assumption, but 

2 

calculation of Hotel ling's T statistic for correlated data comparing mean 
vectors for sampled and constant TtP'm~tr'itS'TPYrn-ff^thnt thr rUff pr^nrpq in 
group performance were not statistically significant. A canonical analysis 
^.\so was used to assess the degree of shared content variability between 
scores on the sampled and constant test items. A canonical correlation of 
,78 was found on the first and only significant canonical factor extracted. 



Insert Table I 



The results of the multiple linear regression used to predict 
performance on the final examination are presented In Table 2. Using constant 
l-hsm scores. Item-sampled scores, and coffibined total scores as predictor 
variables, the multiple correlations were .776, .834, and .854 respectively, 
[lach of these correlations differ significantly frcm zero (pV'.OI). To test 
the difference between pairs of multiple R's, an Intercorrelatlon matrix with 
correlations between predicted scores based on the three multiple regression 
equations v/as obtained. A significant difference (p'.OS, one-tailed tost) 
was found between multiple R's when using total scores as compared to constant 
Item scores as predictors (.854 vs. .776). All other differences v/ere not 
statistically significant. 
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Insert Table 2 



Discuss Ion and Conclusion 
- -nesa^-i^p an fhl«^ rps pn rcb -j)ave indir atfiri +h;.+ item-sampling as a 
r-joihcd for measuring classroom achievement provides more precise Information 
although not statistically significant than tests of the same length constructed 
•n the trad:tional manner. The present Investigation demonstra-ted that test 
results arrived at by a systematic process of Item-sampling might provide 
c more accupote index of an individual's true score. The results have Indicated 
ihel- Item-sampling can be employed for classroom evaluation without the fear 
thot the procedure itself will deter from some estimate of an individual's 
performance. The Implication is that Item-sampling can provide feedback to 
thii instructor giving him the opportunity for observing pupil outcomes and 
;=nderstandirgs over a greater range of content objectives within the same time 
•imlts that typically provide for a narrower sampling of course related objective- 
by way of traditional testing methods. 

It should be noted that, for the sampling In this experiment, there was 
no statistically significant difference In the predictability of the Item- 
sa;npled test and the combined total test which was twice as long whereas a 
statistically significant difference did exist between the predictability of 
tho longer test and the shorter constant Item test. It may be that item- 
Sdnipllng, In addition to covering a greater range of content objectives, can 
do so with a fewer number of Items per test without losing a significant 
<imount of predictive power. 
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Table I 



^'iea^lS, Standard Deviations, I ntercorrelatlons, 
and Spfit-Ha! f Ral iabi 1 itfes for Constant lteT)S, 
Sampled Items, Total Test, and Final Examination 



Exam 


Items 


f'lean 


S.D. ' 


Sampled 


Total 


1 

Final ' 


rtt 




Constant 


7.23 


2. 12 • 


.61 


.91 


.68 > 




1 


Sampled 


7.22 


i.85 ' 




.88 


.6i : 






Total 


14.45 


3.57 ' 






.72 ' 


.75 




Constant 


8.03 


1.72 ' 


.50 


.85 


.58 • 




li 


Sampled 


7.74 


1.91 ' 




.88 


.69 ' 


.67 




Total 


15.77 


3.15 ' 






.74 ' 




Constant 


7.28 


1.74 • 


.50 


.84 


.5a ' 






Sampled 


7.09 


2.08 ' 




.89 


.70 ■ 






Total 


14.38 


3.31 ' 






.73 ' 


.67 


Final 




40. 16 


8.52 ' 








.91 



Table 2 



Step-wise Multiple Correlations in 
Predicting Final Exam Perfonnance 





Order of Entering 


Exam 


P 


RSQ 


Increment 


1st 


Cons+ant ! 


.575 


.456 


.456 


2nd 


Constant 


.757 


.573 


.117 


3rd 


Constont !l 


.776 


.603 


.030 


1st 


Sampled III 


.703 


.494 


.494 


2nd 


Samp! 3d li 


.810 


.656 


. 162 


3rd 


Sampled 1 


.834 


.696 


.040 


1st 


Total II! 


.734 


.540 


.540 


2nd 


Total i 


.824 


.680 


. i40 


3rd 


Total II 


.854 


.730 


.050 
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